49

DOI: 10.1201/9781003355205-2

C h a p t e r 2

Mapping of Sequence Reads

to the Reference Genomes

2.1  INTRODUCTION TO SEQUENCE MAPPING

So far, we have already gone through the first two steps of the NGS/HTP data analysis,

namely acquiring the raw data in FASTQ file format and read quality control. Up to this

point, you know that the sequencing raw data must be cleaned from errors and artifacts,

as much as possible, before moving on to the next step of the data analysis. This chapter

discusses the alignment of reads (short or long) to a reference genome of an organism. This

step is crucial for most of the sequencing applications including reference-guided genome

assembly, variant discovery, gene expression (RNA-Seq), epigenetics (ChIP-Seq, Methyl-

Seq), and metagenomics (targeted and shotgun). The reference genome sequence of an

organism is a key element of read alignment or mapping. Scientists have devoted enormous

amount of time and efforts to determine the sequences of many organisms. Complete

genomes of hundreds of organisms have already been sequenced and the list continues to

grow. The sequencing of human genome was completed in 2003 by the National Human

Genome Research Institute (NHGRI), followed by sequencing the genomes of a vari-

ety of model organisms that are used as surrogates in studying the human biology, then

genomes of numerous of organisms, including some extinct organisms like Neanderthals,

were sequenced. The first sequenced genomes of model organisms include the rat, puffer

fish, fruit fly, sea squirt, roundworm, and the bacterium Escherichia coli. The NHGRI has

sequenced numerous species with the aim to provide data for understanding genetic varia-

tions among organisms. Genome sequences are available in sequence databases funded

by governments and supported by institutions. A reference genome sequence of an organ-

ism is a curated sequence that represents the genome of the individuals of that organism.

However, the sequences of the individuals are varied and the reference sequence is only a

sequence that we compare other sequences to. These days, there are reference genomes for

thousands of organisms, including animal, plants, fungi, bacteria, archaea, and viruses,